[Models] Add SharedFusedMoE support to Qwen3MoE#32082
[Models] Add SharedFusedMoE support to Qwen3MoE#32082ywang96 merged 11 commits intovllm-project:mainfrom
SharedFusedMoE support to Qwen3MoE#32082Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for SharedFusedMoE to Qwen3MoE, which is a valuable enhancement for handling models with shared experts. The overall approach is sound, but I've identified two critical issues that would lead to runtime errors. One is an incorrect method call with an undefined argument, and the other is improper handling of the return value from SharedFusedMoE.forward. Please see my detailed comments for suggestions on how to fix these issues.
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for shared experts in Qwen3MoE by integrating SharedFusedMoE. The changes correctly set up the shared expert and its gating mechanism. However, I've found a critical issue in the tensor parallelism logic where the final output is not correctly reduced across tensor parallel ranks, which will lead to incorrect model outputs when tp_size > 1. I've provided a fix for this issue.
| hidden_act: str, | ||
| quant_config: QuantizationConfig | None = None, | ||
| reduce_results: bool = True, | ||
| expert_gate: torch.nn.Linear | None = None, |
There was a problem hiding this comment.
Incorrect type hint causes wrong tensor indexing
Low Severity
The expert_gate parameter is typed as torch.nn.Linear | None but is actually a ReplicatedLinear. The code does self.expert_gate(x)[0] to extract the output from a tuple returned by ReplicatedLinear.forward. However, torch.nn.Linear.forward returns a tensor directly, not a tuple, so [0] would incorrectly index into the first dimension of the tensor instead of extracting the output. While the current code works because only ReplicatedLinear is passed, the type annotation is misleading and using torch.nn.Linear as documented would produce silently incorrect results.
Additional Locations (1)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
Hi @Isotr0py, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: 陈建华 <1647430658@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com>
…vllm-project#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by #6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by vllm-project#6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
…vllm-project#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by vllm-project#6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
…vllm-project#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by vllm-project#6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…vllm-project#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by vllm-project#6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
…vllm-project#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by vllm-project#6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…vllm-project#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by vllm-project#6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
…vllm-project#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com>
### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by vllm-project#6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Purpose
Test Plan
vllm side with:
vllm-omni side with https://github.com/Isotr0py/vllm-omni/tree/check-qwen3-omni-moe-talker:
Test Result
Test with
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8withtp_size=2:Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Cursor Bugbot is generating a summary for commit bf79a7a. Configure here.
Note
Introduces shared-expert support in Qwen3MoE using
SharedFusedMoE.FusedMoEwithSharedFusedMoE; addgate, optionalshared_expert_gate, andshared_expertMLP (controlled byshared_expert_intermediate_size)forwardin sparse MoE block to computeshared_outandfused_out, sum when present, and perform TP all-reduce when neededQwen3MoeMLPto accept an optionalexpert_gateand apply sigmoid gating to outputsSharedFusedMoE.make_expert_params_mapping; setreduce_results=Falseto defer reductionWritten by Cursor Bugbot for commit bf79a7a. This will update automatically on new commits. Configure here.
Note
Introduces shared-expert support in Qwen3MoE via
SharedFusedMoEand optional shared MLP gating.FusedMoEwithSharedFusedMoEin sparse MoE block; addgate, optionalshared_expert_gate, andshared_expertMLP (controlled byshared_expert_intermediate_size)forwardto computeshared_outandfused_out, sum when present, and perform TP all-reduce when not sequence-parallel; setreduce_results=FalseQwen3MoeMLPto acceptexpert_gateand apply sigmoid gating to outputsSharedFusedMoE.make_expert_params_mappingfor expert weight loadingWritten by Cursor Bugbot for commit f4ebf48. This will update automatically on new commits. Configure here.